This notebook is a template with each step that you need to complete for the project.
Please fill in your code where there are explicit ? markers in the notebook. You are welcome to add more cells and code as you see fit.
Once you have completed all the code implementations, please export your notebook as a HTML file so the reviews can view your code. Make sure you have all outputs correctly outputted.
File-> Export Notebook As... -> Export Notebook as HTML
There is a writeup to complete as well after all code implememtation is done. Please answer all questions and attach the necessary tables and charts. You can complete the writeup in either markdown or PDF.
Completing the code template and writeup template will cover all of the rubric points for this project.
The rubric contains "Stand Out Suggestions" for enhancing the project beyond the minimum requirements. The stand out suggestions are optional. If you decide to pursue the "stand out suggestions", you can include the code in this notebook and also discuss the results in the writeup file.
Below is example of steps to get the API username and key. Each student will have their own username and key.
kaggle.json and use the username and key.
ml.t3.medium instance (2 vCPU + 4 GiB)Python 3 (MXNet 1.8 Python 3.7 CPU Optimized)!pip install pydantic==1.10.2
!pip install -U pip
!pip install -U setuptools wheel
!pip install -U "mxnet<2.0.0" bokeh==2.0.1
!pip install autogluon --no-cache-dir
# Without --no-cache-dir, smaller aws instances may have trouble installing
!pip install -U python-dotenv
!pip install -U kaggle
!pip install -U pandas-profiling
!pip install ipywidgets==7.7.2
# create the .kaggle directory and an empty kaggle.json file
!mkdir -p /root/.kaggle
!touch /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json
from dotenv import load_dotenv
from os import environ
load_dotenv()
True
# Fill in your user name and key from creating the kaggle account and API token file
import json
kaggle_username = environ.get("KAGGLE_USERNAME")
kaggle_key = environ.get("KAGGLE_KEY")
# Save API token the kaggle.json file
with open("/root/.kaggle/kaggle.json", "w") as f:
f.write(json.dumps({"username": kaggle_username, "key": kaggle_key}))
# Download the dataset, it will be in a .zip file so you'll need to unzip it as well.
#!kaggle competitions download -c bike-sharing-demand
# If you already downloaded it you can use the -o command to overwrite the file
!unzip -o bike-sharing-demand.zip
Archive: bike-sharing-demand.zip inflating: sampleSubmission.csv inflating: test.csv inflating: train.csv
import pandas as pd
from autogluon.tabular import TabularPredictor
import bokeh
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
# Create the train dataset in pandas by reading the csv
# Set the parsing of the datetime column so you can use some of the `dt` features in pandas later
train = pd.read_csv("train.csv", parse_dates=["datetime"])
train.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 |
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10886 entries, 0 to 10885 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 10886 non-null datetime64[ns] 1 season 10886 non-null int64 2 holiday 10886 non-null int64 3 workingday 10886 non-null int64 4 weather 10886 non-null int64 5 temp 10886 non-null float64 6 atemp 10886 non-null float64 7 humidity 10886 non-null int64 8 windspeed 10886 non-null float64 9 casual 10886 non-null int64 10 registered 10886 non-null int64 11 count 10886 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(8) memory usage: 1020.7 KB
# Simple output of the train dataset to view some of the min/max/varition of the dataset features.
train.describe()
| season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.00000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 |
| mean | 2.506614 | 0.028569 | 0.680875 | 1.418427 | 20.23086 | 23.655084 | 61.886460 | 12.799395 | 36.021955 | 155.552177 | 191.574132 |
| std | 1.116174 | 0.166599 | 0.466159 | 0.633839 | 7.79159 | 8.474601 | 19.245033 | 8.164537 | 49.960477 | 151.039033 | 181.144454 |
| min | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.82000 | 0.760000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 13.94000 | 16.665000 | 47.000000 | 7.001500 | 4.000000 | 36.000000 | 42.000000 |
| 50% | 3.000000 | 0.000000 | 1.000000 | 1.000000 | 20.50000 | 24.240000 | 62.000000 | 12.998000 | 17.000000 | 118.000000 | 145.000000 |
| 75% | 4.000000 | 0.000000 | 1.000000 | 2.000000 | 26.24000 | 31.060000 | 77.000000 | 16.997900 | 49.000000 | 222.000000 | 284.000000 |
| max | 4.000000 | 1.000000 | 1.000000 | 4.000000 | 41.00000 | 45.455000 | 100.000000 | 56.996900 | 367.000000 | 886.000000 | 977.000000 |
# Create the test pandas dataframe in pandas by reading the csv, remember to parse the datetime!
test = pd.read_csv("test.csv", parse_dates=["datetime"])
test.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-20 00:00:00 | 1 | 0 | 1 | 1 | 10.66 | 11.365 | 56 | 26.0027 |
| 1 | 2011-01-20 01:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
| 2 | 2011-01-20 02:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
| 3 | 2011-01-20 03:00:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 |
| 4 | 2011-01-20 04:00:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 |
# Same thing as train and test dataset
submission = pd.read_csv("sampleSubmission.csv", parse_dates=["datetime"])
submission.head()
| datetime | count | |
|---|---|---|
| 0 | 2011-01-20 00:00:00 | 0 |
| 1 | 2011-01-20 01:00:00 | 0 |
| 2 | 2011-01-20 02:00:00 | 0 |
| 3 | 2011-01-20 03:00:00 | 0 |
| 4 | 2011-01-20 04:00:00 | 0 |
Requirements:
count, so it is the label we are setting.casual and registered columns as they are also not present in the test dataset. root_mean_squared_error as the metric to use for evaluation.best_quality to focus on creating the best model.learner_kwargs = {
"ignored_columns": ["casual", "registered"]
}
predictor = TabularPredictor(label="count", learner_kwargs=learner_kwargs, problem_type="regression",
eval_metric="root_mean_squared_error").fit(train_data=train, time_limit=600, presets="best_quality")
No path specified. Models will be saved in: "AutogluonModels/ag-20230104_020241/"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20230104_020241/"
AutoGluon Version: 0.6.1
Python Version: 3.7.10
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Wed Oct 26 20:36:53 UTC 2022
Train Data Rows: 10886
Train Data Columns: 11
Label Column: count
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Dropping user-specified ignored columns: ['casual', 'registered']
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 3070.91 MB
Train Data (Original) Memory Usage: 0.78 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting DatetimeFeatureGenerator...
/usr/local/lib/python3.7/site-packages/autogluon/features/generators/datetime.py:59: FutureWarning: casting datetime64[ns, UTC] values to int64 with .astype(...) is deprecated and will raise in a future version. Use .view(...) instead.
good_rows = series[~series.isin(bad_rows)].astype(np.int64)
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('datetime', []) : 1 | ['datetime']
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 5 | ['season', 'holiday', 'workingday', 'weather', 'humidity']
Types of features in processed data (raw dtype, special dtypes):
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 3 | ['season', 'weather', 'humidity']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
0.5s = Fit runtime
9 features in original data used to generate 13 features in processed data.
Train Data (Processed) Memory Usage: 0.98 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.54s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 2 stack levels (L1 to L2) ...
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 399.54s of the 599.45s of remaining time.
-101.5462 = Validation score (-root_mean_squared_error)
0.03s = Training runtime
0.1s = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 396.26s of the 596.17s of remaining time.
-84.1251 = Validation score (-root_mean_squared_error)
0.03s = Training runtime
0.1s = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 395.9s of the 595.81s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-131.4609 = Validation score (-root_mean_squared_error)
64.13s = Training runtime
5.95s = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 321.02s of the 520.94s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-131.0542 = Validation score (-root_mean_squared_error)
29.44s = Training runtime
1.29s = Validation runtime
Fitting model: RandomForestMSE_BAG_L1 ... Training model for up to 287.3s of the 487.22s of remaining time.
-116.5443 = Validation score (-root_mean_squared_error)
10.62s = Training runtime
0.52s = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 273.53s of the 473.44s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-130.5034 = Validation score (-root_mean_squared_error)
197.78s = Training runtime
0.09s = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L1 ... Training model for up to 72.0s of the 271.91s of remaining time.
-124.5881 = Validation score (-root_mean_squared_error)
4.85s = Training runtime
0.51s = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 63.99s of the 263.9s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-137.5911 = Validation score (-root_mean_squared_error)
77.11s = Training runtime
0.41s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 182.66s of remaining time.
-84.1251 = Validation score (-root_mean_squared_error)
0.49s = Training runtime
0.0s = Validation runtime
Fitting 9 L2 models ...
Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 182.1s of the 182.08s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-60.2855 = Validation score (-root_mean_squared_error)
55.13s = Training runtime
3.4s = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 121.27s of the 121.25s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-55.161 = Validation score (-root_mean_squared_error)
24.69s = Training runtime
0.22s = Validation runtime
Fitting model: RandomForestMSE_BAG_L2 ... Training model for up to 92.49s of the 92.47s of remaining time.
-53.3704 = Validation score (-root_mean_squared_error)
26.18s = Training runtime
0.6s = Validation runtime
Fitting model: CatBoost_BAG_L2 ... Training model for up to 63.25s of the 63.24s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-55.6524 = Validation score (-root_mean_squared_error)
62.87s = Training runtime
0.06s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the -3.59s of remaining time.
-53.0732 = Validation score (-root_mean_squared_error)
0.28s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 604.06s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230104_020241/")
# Get detailed info of the predictor
pred_info = predictor.info()
with open('docs/pred_info.json', 'w') as convert_file:
convert_file.write(json.dumps(pred_info, default=str))
#from bokeh.plotting import figure, show
#from bokeh.io import output_notebook
#output_notebook()
predictor.fit_summary(show_plot=False)
*** Summary of fit() ***
Estimated performance of each model:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L3 -53.073174 13.265574 553.142472 0.000775 0.277415 3 True 14
1 RandomForestMSE_BAG_L2 -53.370416 9.581156 410.168097 0.600952 26.182611 2 True 12
2 LightGBM_BAG_L2 -55.160954 9.202732 408.676177 0.222528 24.690691 2 True 11
3 CatBoost_BAG_L2 -55.652386 9.038394 446.859077 0.058190 62.873591 2 True 13
4 LightGBMXT_BAG_L2 -60.285482 12.383130 439.118164 3.402925 55.132678 2 True 10
5 KNeighborsDist_BAG_L1 -84.125061 0.103782 0.029244 0.103782 0.029244 1 True 2
6 WeightedEnsemble_L2 -84.125061 0.104525 0.522287 0.000743 0.493043 2 True 9
7 KNeighborsUnif_BAG_L1 -101.546199 0.104947 0.030970 0.104947 0.030970 1 True 1
8 RandomForestMSE_BAG_L1 -116.544294 0.521331 10.616067 0.521331 10.616067 1 True 5
9 ExtraTreesMSE_BAG_L1 -124.588053 0.513939 4.845773 0.513939 4.845773 1 True 7
10 CatBoost_BAG_L1 -130.503441 0.092072 197.782267 0.092072 197.782267 1 True 6
11 LightGBM_BAG_L1 -131.054162 1.289819 29.443498 1.289819 29.443498 1 True 4
12 LightGBMXT_BAG_L1 -131.460909 5.948354 64.125232 5.948354 64.125232 1 True 3
13 NeuralNetFastAI_BAG_L1 -137.591119 0.405960 77.112434 0.405960 77.112434 1 True 8
Number of models trained: 14
Types of models trained:
{'StackerEnsembleModel_KNN', 'StackerEnsembleModel_LGB', 'StackerEnsembleModel_XT', 'StackerEnsembleModel_NNFastAiTabular', 'WeightedEnsembleModel', 'StackerEnsembleModel_CatBoost', 'StackerEnsembleModel_RF'}
Bagging used: True (with 8 folds)
Multi-layer stack-ensembling used: True (with 3 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 3 | ['season', 'weather', 'humidity']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20230104_020241/SummaryOfModels.html
*** End of fit() summary ***
{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
'RandomForestMSE_BAG_L1': 'StackerEnsembleModel_RF',
'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
'ExtraTreesMSE_BAG_L1': 'StackerEnsembleModel_XT',
'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
'WeightedEnsemble_L2': 'WeightedEnsembleModel',
'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
'RandomForestMSE_BAG_L2': 'StackerEnsembleModel_RF',
'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
'WeightedEnsemble_L3': 'WeightedEnsembleModel'},
'model_performance': {'KNeighborsUnif_BAG_L1': -101.54619908446061,
'KNeighborsDist_BAG_L1': -84.12506123181602,
'LightGBMXT_BAG_L1': -131.46090891834504,
'LightGBM_BAG_L1': -131.054161598899,
'RandomForestMSE_BAG_L1': -116.54429428704391,
'CatBoost_BAG_L1': -130.50344119744508,
'ExtraTreesMSE_BAG_L1': -124.58805258915959,
'NeuralNetFastAI_BAG_L1': -137.59111927600816,
'WeightedEnsemble_L2': -84.12506123181602,
'LightGBMXT_BAG_L2': -60.285481674376115,
'LightGBM_BAG_L2': -55.160953725178764,
'RandomForestMSE_BAG_L2': -53.37041620071757,
'CatBoost_BAG_L2': -55.65238600039221,
'WeightedEnsemble_L3': -53.0731743886261},
'model_best': 'WeightedEnsemble_L3',
'model_paths': {'KNeighborsUnif_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/KNeighborsUnif_BAG_L1/',
'KNeighborsDist_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/KNeighborsDist_BAG_L1/',
'LightGBMXT_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/LightGBMXT_BAG_L1/',
'LightGBM_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/LightGBM_BAG_L1/',
'RandomForestMSE_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/RandomForestMSE_BAG_L1/',
'CatBoost_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/CatBoost_BAG_L1/',
'ExtraTreesMSE_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/ExtraTreesMSE_BAG_L1/',
'NeuralNetFastAI_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/NeuralNetFastAI_BAG_L1/',
'WeightedEnsemble_L2': 'AutogluonModels/ag-20230104_020241/models/WeightedEnsemble_L2/',
'LightGBMXT_BAG_L2': 'AutogluonModels/ag-20230104_020241/models/LightGBMXT_BAG_L2/',
'LightGBM_BAG_L2': 'AutogluonModels/ag-20230104_020241/models/LightGBM_BAG_L2/',
'RandomForestMSE_BAG_L2': 'AutogluonModels/ag-20230104_020241/models/RandomForestMSE_BAG_L2/',
'CatBoost_BAG_L2': 'AutogluonModels/ag-20230104_020241/models/CatBoost_BAG_L2/',
'WeightedEnsemble_L3': 'AutogluonModels/ag-20230104_020241/models/WeightedEnsemble_L3/'},
'model_fit_times': {'KNeighborsUnif_BAG_L1': 0.030969619750976562,
'KNeighborsDist_BAG_L1': 0.029244422912597656,
'LightGBMXT_BAG_L1': 64.12523245811462,
'LightGBM_BAG_L1': 29.443498373031616,
'RandomForestMSE_BAG_L1': 10.616066694259644,
'CatBoost_BAG_L1': 197.78226709365845,
'ExtraTreesMSE_BAG_L1': 4.845773220062256,
'NeuralNetFastAI_BAG_L1': 77.11243438720703,
'WeightedEnsemble_L2': 0.4930429458618164,
'LightGBMXT_BAG_L2': 55.13267803192139,
'LightGBM_BAG_L2': 24.690690755844116,
'RandomForestMSE_BAG_L2': 26.182611227035522,
'CatBoost_BAG_L2': 62.87359070777893,
'WeightedEnsemble_L3': 0.27741503715515137},
'model_pred_times': {'KNeighborsUnif_BAG_L1': 0.10494709014892578,
'KNeighborsDist_BAG_L1': 0.10378217697143555,
'LightGBMXT_BAG_L1': 5.948354005813599,
'LightGBM_BAG_L1': 1.2898194789886475,
'RandomForestMSE_BAG_L1': 0.5213305950164795,
'CatBoost_BAG_L1': 0.0920724868774414,
'ExtraTreesMSE_BAG_L1': 0.5139386653900146,
'NeuralNetFastAI_BAG_L1': 0.4059596061706543,
'WeightedEnsemble_L2': 0.0007426738739013672,
'LightGBMXT_BAG_L2': 3.402925491333008,
'LightGBM_BAG_L2': 0.22252774238586426,
'RandomForestMSE_BAG_L2': 0.6009519100189209,
'CatBoost_BAG_L2': 0.05818963050842285,
'WeightedEnsemble_L3': 0.0007748603820800781},
'num_bag_folds': 8,
'max_stack_level': 3,
'model_hyperparams': {'KNeighborsUnif_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'KNeighborsDist_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'LightGBMXT_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'RandomForestMSE_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'CatBoost_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'ExtraTreesMSE_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'NeuralNetFastAI_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L2': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBMXT_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'RandomForestMSE_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'CatBoost_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L3': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True}},
'leaderboard': model score_val pred_time_val fit_time \
0 WeightedEnsemble_L3 -53.073174 13.265574 553.142472
1 RandomForestMSE_BAG_L2 -53.370416 9.581156 410.168097
2 LightGBM_BAG_L2 -55.160954 9.202732 408.676177
3 CatBoost_BAG_L2 -55.652386 9.038394 446.859077
4 LightGBMXT_BAG_L2 -60.285482 12.383130 439.118164
5 KNeighborsDist_BAG_L1 -84.125061 0.103782 0.029244
6 WeightedEnsemble_L2 -84.125061 0.104525 0.522287
7 KNeighborsUnif_BAG_L1 -101.546199 0.104947 0.030970
8 RandomForestMSE_BAG_L1 -116.544294 0.521331 10.616067
9 ExtraTreesMSE_BAG_L1 -124.588053 0.513939 4.845773
10 CatBoost_BAG_L1 -130.503441 0.092072 197.782267
11 LightGBM_BAG_L1 -131.054162 1.289819 29.443498
12 LightGBMXT_BAG_L1 -131.460909 5.948354 64.125232
13 NeuralNetFastAI_BAG_L1 -137.591119 0.405960 77.112434
pred_time_val_marginal fit_time_marginal stack_level can_infer \
0 0.000775 0.277415 3 True
1 0.600952 26.182611 2 True
2 0.222528 24.690691 2 True
3 0.058190 62.873591 2 True
4 3.402925 55.132678 2 True
5 0.103782 0.029244 1 True
6 0.000743 0.493043 2 True
7 0.104947 0.030970 1 True
8 0.521331 10.616067 1 True
9 0.513939 4.845773 1 True
10 0.092072 197.782267 1 True
11 1.289819 29.443498 1 True
12 5.948354 64.125232 1 True
13 0.405960 77.112434 1 True
fit_order
0 14
1 12
2 11
3 13
4 10
5 2
6 9
7 1
8 5
9 7
10 6
11 4
12 3
13 8 }
predictor.leaderboard(silent=True).plot(kind="bar", x="model", y="score_val")
<AxesSubplot:xlabel='model'>
# Save validation scores
leaderboard = predictor.leaderboard()
leaderboard["description"] = "baseline with raw features"
leaderboard.to_csv("docs/leaderboard.csv", index=False)
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order 0 WeightedEnsemble_L3 -53.073174 13.265574 553.142472 0.000775 0.277415 3 True 14 1 RandomForestMSE_BAG_L2 -53.370416 9.581156 410.168097 0.600952 26.182611 2 True 12 2 LightGBM_BAG_L2 -55.160954 9.202732 408.676177 0.222528 24.690691 2 True 11 3 CatBoost_BAG_L2 -55.652386 9.038394 446.859077 0.058190 62.873591 2 True 13 4 LightGBMXT_BAG_L2 -60.285482 12.383130 439.118164 3.402925 55.132678 2 True 10 5 KNeighborsDist_BAG_L1 -84.125061 0.103782 0.029244 0.103782 0.029244 1 True 2 6 WeightedEnsemble_L2 -84.125061 0.104525 0.522287 0.000743 0.493043 2 True 9 7 KNeighborsUnif_BAG_L1 -101.546199 0.104947 0.030970 0.104947 0.030970 1 True 1 8 RandomForestMSE_BAG_L1 -116.544294 0.521331 10.616067 0.521331 10.616067 1 True 5 9 ExtraTreesMSE_BAG_L1 -124.588053 0.513939 4.845773 0.513939 4.845773 1 True 7 10 CatBoost_BAG_L1 -130.503441 0.092072 197.782267 0.092072 197.782267 1 True 6 11 LightGBM_BAG_L1 -131.054162 1.289819 29.443498 1.289819 29.443498 1 True 4 12 LightGBMXT_BAG_L1 -131.460909 5.948354 64.125232 5.948354 64.125232 1 True 3 13 NeuralNetFastAI_BAG_L1 -137.591119 0.405960 77.112434 0.405960 77.112434 1 True 8
predictions = predictor.predict(test)
predictions.head()
0 23.979008 1 41.106430 2 45.552490 3 48.853279 4 51.996368 Name: count, dtype: float32
# Describe the `predictions` series to see if there are any negative values
predictions.describe()
count 6493.000000 mean 100.831001 std 89.846375 min 3.146910 25% 20.709923 50% 63.959476 75% 168.133774 max 364.188293 Name: count, dtype: float64
# How many negative values do we have?
predictions_df = pd.DataFrame(predictions)
count_neg = len(predictions_df[predictions_df["count"] < 0])
# Set them to zero
if count_neg > 0:
predictions_df.loc[predictions_df["count"] < 0, ["count"]] = 0
print("{} Negative predictions were set to zero" . format(count_neg))
print(predictions_df[predictions_df["count"]==0])
else: print("{} negatives values were found" .format(count_neg))
0 negatives values were found
submission["count"] = predictions.round(0).astype(int)
submission.to_csv("submission.csv", index=False)
!kaggle competitions submit -c bike-sharing-demand -f submission.csv -m "first raw submission"
100%|█████████████████████████████████████████| 148k/148k [00:00<00:00, 264kB/s] Successfully submitted to Bike Sharing Demand
My Submissions¶!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName date description status publicScore privateScore --------------------------- ------------------- -------------------------------------------- -------- ----------- ------------ submission.csv 2023-01-04 02:17:18 first raw submission complete 1.79200 1.79200 submission_hpo.csv 2023-01-04 01:59:41 model with new features and hpo complete 0.47675 0.47675 submission_hpo.csv 2023-01-04 01:45:44 model with new features and hpo complete 0.48014 0.48014 submission_hpo.csv 2023-01-04 01:33:22 model with new features and hpo complete 0.50426 0.50426 tail: write error: Broken pipe
#Score: 1.79200
# Create a histogram of all features to show the distribution of each one relative to the data. This is part of the exploritory data analysis
train.hist(figsize=(12, 10))
plt.show()
# Create a new feature
train["hour"] = train["datetime"].dt.hour
test["hour"] = test["datetime"].dt.hour
train.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 | 0 |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 | 1 |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 | 2 |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 | 3 |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 | 4 |
# Profiler report (train data)
profile = ProfileReport(train)
profile.to_notebook_iframe()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
# Visualizations
# Distribution of hourly bike demand by time features
train.groupby([train["datetime"].dt.month, "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by month (train data)")
train.groupby([train["datetime"].dt.hour, "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by hour (train data)")
train.groupby([train["datetime"].dt.dayofweek, "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by dayofweek (train data)")
plt.show()
train.groupby(["holiday"])["count"].median().plot(
kind='bar', title="Median of hourly bike demand by holiday (train data)")
plt.show()
# Distribution of hourly bike demand by weather features
train.groupby(["season", "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by season (train data)")
train.groupby(["weather", "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by weather (train data)")
train.groupby(["temp", "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by temp (train data)")
train.groupby(["atemp", "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by atemp (train data)")
train.groupby(["windspeed", "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by windspeed (train data)")
train.groupby(["humidity", "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by humidity (train data)")
plt.show()
# Distribution of events by categorical features
train["season"].value_counts().plot(
kind='bar', title="Number of events by season (train data)")
plt.show()
train["weather"].value_counts().plot(
kind='bar', title="Number of events by weather (train data)")
plt.show()
train["holiday"].value_counts().plot(
kind='bar', title="Number of events by holiday (train data)")
plt.show()
train["workingday"].value_counts().plot(
kind='bar', title="Number of events by workingday (train data)")
plt.show()
display(train[train['weather'] == 4])
display(test[test['weather'] == 4])
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5631 | 2012-01-09 18:00:00 | 1 | 0 | 1 | 4 | 8.2 | 11.365 | 86 | 6.0032 | 6 | 158 | 164 | 18 |
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | hour | |
|---|---|---|---|---|---|---|---|---|---|---|
| 154 | 2011-01-26 16:00:00 | 1 | 0 | 1 | 4 | 9.02 | 9.85 | 93 | 22.0028 | 16 |
| 3248 | 2012-01-21 01:00:00 | 1 | 0 | 0 | 4 | 5.74 | 6.82 | 86 | 12.9980 | 1 |
# As there are only 3 events in weather category 4 ("heavy rain"), those values are replaced as category 3 ("light rain")
train.loc[train['weather'] == 4, 'weather'] = 3
test.loc[test['weather'] == 4, 'weather'] = 3
# Functions for generating new features values
def get_time_of_day(hour):
if (hour >= 7) & (hour <= 9):
return "morning"
elif (hour >= 12) & (hour <= 15):
return "lunch"
elif (hour >= 16) & (hour <= 19):
return "rush_hour"
elif (hour >= 20) & (hour <= 23):
return "night"
else: return "other"
def get_tempcat(temp):
if (temp >= 35):
return "very hot"
elif (temp >= 25) & (temp < 35):
return "hot"
elif (temp >= 15) & (temp < 25):
return "warm"
elif (temp >= 10) & (temp < 15):
return "cool"
else: return "cold"
def get_windcat(windspeed):
if (windspeed > 20):
return "windy"
elif (windspeed > 10) & (windspeed <= 20):
return "mild"
else: return "low"
def get_humiditycat(humidity):
if (humidity >= 80):
return "high"
elif (humidity > 40) & (humidity < 80):
return "mild"
else: return "low"
# New features are generated
train["time_of_day"] = train['hour'].apply(get_time_of_day)
test['time_of_day'] = test['hour'].apply(get_time_of_day)
train['atempcat'] = train['atemp'].apply(get_tempcat)
test['atempcat'] = test['atemp'].apply(get_tempcat)
train['tempcat'] = train['temp'].apply(get_tempcat)
test['tempcat'] = test['temp'].apply(get_tempcat)
train['windcat'] = train['windspeed'].apply(get_windcat)
test['windcat'] = test['windspeed'].apply(get_windcat)
train['humiditycat'] = train['humidity'].apply(get_humiditycat)
test['humiditycat'] = test['humidity'].apply(get_humiditycat)
train.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | hour | time_of_day | atempcat | tempcat | windcat | humiditycat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 | 0 | other | cool | cold | low | high |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 | 1 | other | cool | cold | low | high |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 | 2 | other | cool | cold | low | high |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 | 3 | other | cool | cold | low | mild |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 | 4 | other | cool | cold | low | mild |
# Plot new categories
train["time_of_day"].value_counts().plot(
kind='bar', title="Number of events by time_of_day (train data)")
plt.show()
train["atempcat"].value_counts().plot(
kind='bar', title="Number of events by atempcat (train data)")
plt.show()
train["tempcat"].value_counts().plot(
kind='bar', title="Number of events by tempcat (train data)")
plt.show()
train["windcat"].value_counts().plot(
kind='bar', title="Number of events by windcat (train data)")
plt.show()
train["humiditycat"].value_counts().plot(
kind='bar', title="Number of events by humiditycat (train data)")
plt.show()
category_list = ["season", "weather", "holiday", "workingday"]
train[category_list] = train[category_list].astype("category")
test[category_list] = test[category_list].astype("category")
new_category_list = ["time_of_day", "atempcat", "windcat", "humiditycat", "tempcat"]
train[new_category_list] = train[new_category_list].astype("category")
test[new_category_list] = test[new_category_list].astype("category")
# View the new feature
train.info()
test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10886 entries, 0 to 10885 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 10886 non-null datetime64[ns] 1 season 10886 non-null category 2 holiday 10886 non-null category 3 workingday 10886 non-null category 4 weather 10886 non-null category 5 temp 10886 non-null float64 6 atemp 10886 non-null float64 7 humidity 10886 non-null int64 8 windspeed 10886 non-null float64 9 casual 10886 non-null int64 10 registered 10886 non-null int64 11 count 10886 non-null int64 12 hour 10886 non-null int64 13 time_of_day 10886 non-null category 14 atempcat 10886 non-null category 15 tempcat 10886 non-null category 16 windcat 10886 non-null category 17 humiditycat 10886 non-null category dtypes: category(9), datetime64[ns](1), float64(3), int64(5) memory usage: 862.7 KB <class 'pandas.core.frame.DataFrame'> RangeIndex: 6493 entries, 0 to 6492 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 6493 non-null datetime64[ns] 1 season 6493 non-null category 2 holiday 6493 non-null category 3 workingday 6493 non-null category 4 weather 6493 non-null category 5 temp 6493 non-null float64 6 atemp 6493 non-null float64 7 humidity 6493 non-null int64 8 windspeed 6493 non-null float64 9 hour 6493 non-null int64 10 time_of_day 6493 non-null category 11 atempcat 6493 non-null category 12 tempcat 6493 non-null category 13 windcat 6493 non-null category 14 humiditycat 6493 non-null category dtypes: category(9), datetime64[ns](1), float64(3), int64(2) memory usage: 363.0 KB
# View histogram of all features again now with the hour feature
train.hist(figsize=(10, 8))
plt.show()
train.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | hour | time_of_day | atempcat | tempcat | windcat | humiditycat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 | 0 | other | cool | cold | low | high |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 | 1 | other | cool | cold | low | high |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 | 2 | other | cool | cold | low | high |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 | 3 | other | cool | cold | low | mild |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 | 4 | other | cool | cold | low | mild |
# Fit model
learner_kwargs = {
"ignored_columns": ["casual", "registered", "atemp", "windspeed", "humidity", "temp"]
}
predictor_new_features = TabularPredictor(label="count", learner_kwargs=learner_kwargs, problem_type="regression",
eval_metric="root_mean_squared_error").fit(train_data=train, time_limit=600, presets="best_quality")
No path specified. Models will be saved in: "AutogluonModels/ag-20230104_022345/"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20230104_022345/"
AutoGluon Version: 0.6.1
Python Version: 3.7.10
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Wed Oct 26 20:36:53 UTC 2022
Train Data Rows: 10886
Train Data Columns: 17
Label Column: count
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Dropping user-specified ignored columns: ['casual', 'registered', 'atemp', 'windspeed', 'humidity', 'temp']
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 1929.77 MB
Train Data (Original) Memory Usage: 0.27 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting DatetimeFeatureGenerator...
/usr/local/lib/python3.7/site-packages/autogluon/features/generators/datetime.py:59: FutureWarning: casting datetime64[ns, UTC] values to int64 with .astype(...) is deprecated and will raise in a future version. Use .view(...) instead.
good_rows = series[~series.isin(bad_rows)].astype(np.int64)
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('category', []) : 9 | ['season', 'holiday', 'workingday', 'weather', 'time_of_day', ...]
('datetime', []) : 1 | ['datetime']
('int', []) : 1 | ['hour']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 7 | ['season', 'weather', 'time_of_day', 'atempcat', 'tempcat', ...]
('int', []) : 1 | ['hour']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
0.3s = Fit runtime
11 features in original data used to generate 15 features in processed data.
Train Data (Processed) Memory Usage: 0.62 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.36s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 2 stack levels (L1 to L2) ...
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 399.66s of the 599.63s of remaining time.
-101.5462 = Validation score (-root_mean_squared_error)
0.02s = Training runtime
0.1s = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 399.31s of the 599.29s of remaining time.
-84.1251 = Validation score (-root_mean_squared_error)
0.02s = Training runtime
0.1s = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 398.97s of the 598.94s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-35.8791 = Validation score (-root_mean_squared_error)
77.22s = Training runtime
7.39s = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 313.14s of the 513.11s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-32.9854 = Validation score (-root_mean_squared_error)
54.48s = Training runtime
4.5s = Validation runtime
Fitting model: RandomForestMSE_BAG_L1 ... Training model for up to 253.71s of the 453.68s of remaining time.
-39.1691 = Validation score (-root_mean_squared_error)
9.88s = Training runtime
0.57s = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 240.74s of the 440.71s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-35.9861 = Validation score (-root_mean_squared_error)
205.83s = Training runtime
0.21s = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L1 ... Training model for up to 30.9s of the 230.87s of remaining time.
-38.8958 = Validation score (-root_mean_squared_error)
5.56s = Training runtime
0.56s = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 22.2s of the 222.17s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-75.0209 = Validation score (-root_mean_squared_error)
44.2s = Training runtime
0.49s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 173.69s of remaining time.
-32.2135 = Validation score (-root_mean_squared_error)
0.69s = Training runtime
0.0s = Validation runtime
Fitting 9 L2 models ...
Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 172.92s of the 172.9s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-31.9977 = Validation score (-root_mean_squared_error)
28.23s = Training runtime
0.65s = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 140.29s of the 140.27s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-30.7289 = Validation score (-root_mean_squared_error)
25.28s = Training runtime
0.28s = Validation runtime
Fitting model: RandomForestMSE_BAG_L2 ... Training model for up to 110.94s of the 110.92s of remaining time.
-32.1346 = Validation score (-root_mean_squared_error)
26.81s = Training runtime
0.61s = Validation runtime
Fitting model: CatBoost_BAG_L2 ... Training model for up to 81.22s of the 81.2s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-31.1785 = Validation score (-root_mean_squared_error)
77.67s = Training runtime
0.14s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the -0.38s of remaining time.
-30.5547 = Validation score (-root_mean_squared_error)
0.33s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 600.91s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230104_022345/")
# Get detailed info of the predictor
pred_nf_info = predictor_new_features.info()
with open('docs/pred_nf_info.json', 'w') as convert_file:
convert_file.write(json.dumps(pred_nf_info, default=str))
predictor_new_features.fit_summary(show_plot=False)
*** Summary of fit() ***
Estimated performance of each model:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L3 -30.554656 15.607355 555.528752 0.001309 0.329112 3 True 14
1 LightGBM_BAG_L2 -30.728931 14.204675 422.493474 0.284346 25.281531 2 True 11
2 CatBoost_BAG_L2 -31.178451 14.059996 474.882658 0.139667 77.670714 2 True 13
3 LightGBMXT_BAG_L2 -31.997696 14.572388 425.438654 0.652059 28.226711 2 True 10
4 RandomForestMSE_BAG_L2 -32.134557 14.529974 424.020685 0.609645 26.808742 2 True 12
5 WeightedEnsemble_L2 -32.213506 12.771409 348.123248 0.001235 0.694335 2 True 9
6 LightGBM_BAG_L1 -32.985441 4.500878 54.476392 4.500878 54.476392 1 True 4
7 LightGBMXT_BAG_L1 -35.879059 7.386585 77.218651 7.386585 77.218651 1 True 3
8 CatBoost_BAG_L1 -35.986122 0.213159 205.830238 0.213159 205.830238 1 True 6
9 ExtraTreesMSE_BAG_L1 -38.895772 0.558258 5.564719 0.558258 5.564719 1 True 7
10 RandomForestMSE_BAG_L1 -39.169105 0.565517 9.879488 0.565517 9.879488 1 True 5
11 NeuralNetFastAI_BAG_L1 -75.020906 0.488890 44.199071 0.488890 44.199071 1 True 8
12 KNeighborsDist_BAG_L1 -84.125061 0.104034 0.024144 0.104034 0.024144 1 True 2
13 KNeighborsUnif_BAG_L1 -101.546199 0.103008 0.019241 0.103008 0.019241 1 True 1
Number of models trained: 14
Types of models trained:
{'StackerEnsembleModel_KNN', 'StackerEnsembleModel_LGB', 'StackerEnsembleModel_XT', 'StackerEnsembleModel_NNFastAiTabular', 'WeightedEnsembleModel', 'StackerEnsembleModel_CatBoost', 'StackerEnsembleModel_RF'}
Bagging used: True (with 8 folds)
Multi-layer stack-ensembling used: True (with 3 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', []) : 7 | ['season', 'weather', 'time_of_day', 'atempcat', 'tempcat', ...]
('int', []) : 1 | ['hour']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20230104_022345/SummaryOfModels.html
*** End of fit() summary ***
{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
'RandomForestMSE_BAG_L1': 'StackerEnsembleModel_RF',
'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
'ExtraTreesMSE_BAG_L1': 'StackerEnsembleModel_XT',
'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
'WeightedEnsemble_L2': 'WeightedEnsembleModel',
'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
'RandomForestMSE_BAG_L2': 'StackerEnsembleModel_RF',
'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
'WeightedEnsemble_L3': 'WeightedEnsembleModel'},
'model_performance': {'KNeighborsUnif_BAG_L1': -101.54619908446061,
'KNeighborsDist_BAG_L1': -84.12506123181602,
'LightGBMXT_BAG_L1': -35.87905943622523,
'LightGBM_BAG_L1': -32.98544134736175,
'RandomForestMSE_BAG_L1': -39.16910479120037,
'CatBoost_BAG_L1': -35.98612160843292,
'ExtraTreesMSE_BAG_L1': -38.89577199128411,
'NeuralNetFastAI_BAG_L1': -75.02090615783968,
'WeightedEnsemble_L2': -32.213505766365564,
'LightGBMXT_BAG_L2': -31.997695966545237,
'LightGBM_BAG_L2': -30.728930885149015,
'RandomForestMSE_BAG_L2': -32.13455742507246,
'CatBoost_BAG_L2': -31.17845121244172,
'WeightedEnsemble_L3': -30.554655562394768},
'model_best': 'WeightedEnsemble_L3',
'model_paths': {'KNeighborsUnif_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/KNeighborsUnif_BAG_L1/',
'KNeighborsDist_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/KNeighborsDist_BAG_L1/',
'LightGBMXT_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/LightGBMXT_BAG_L1/',
'LightGBM_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/LightGBM_BAG_L1/',
'RandomForestMSE_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/RandomForestMSE_BAG_L1/',
'CatBoost_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/CatBoost_BAG_L1/',
'ExtraTreesMSE_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/ExtraTreesMSE_BAG_L1/',
'NeuralNetFastAI_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/NeuralNetFastAI_BAG_L1/',
'WeightedEnsemble_L2': 'AutogluonModels/ag-20230104_022345/models/WeightedEnsemble_L2/',
'LightGBMXT_BAG_L2': 'AutogluonModels/ag-20230104_022345/models/LightGBMXT_BAG_L2/',
'LightGBM_BAG_L2': 'AutogluonModels/ag-20230104_022345/models/LightGBM_BAG_L2/',
'RandomForestMSE_BAG_L2': 'AutogluonModels/ag-20230104_022345/models/RandomForestMSE_BAG_L2/',
'CatBoost_BAG_L2': 'AutogluonModels/ag-20230104_022345/models/CatBoost_BAG_L2/',
'WeightedEnsemble_L3': 'AutogluonModels/ag-20230104_022345/models/WeightedEnsemble_L3/'},
'model_fit_times': {'KNeighborsUnif_BAG_L1': 0.019240617752075195,
'KNeighborsDist_BAG_L1': 0.02414417266845703,
'LightGBMXT_BAG_L1': 77.2186508178711,
'LightGBM_BAG_L1': 54.476391553878784,
'RandomForestMSE_BAG_L1': 9.879487752914429,
'CatBoost_BAG_L1': 205.83023834228516,
'ExtraTreesMSE_BAG_L1': 5.564719200134277,
'NeuralNetFastAI_BAG_L1': 44.19907069206238,
'WeightedEnsemble_L2': 0.6943349838256836,
'LightGBMXT_BAG_L2': 28.226710557937622,
'LightGBM_BAG_L2': 25.281530618667603,
'RandomForestMSE_BAG_L2': 26.8087420463562,
'CatBoost_BAG_L2': 77.67071437835693,
'WeightedEnsemble_L3': 0.32911157608032227},
'model_pred_times': {'KNeighborsUnif_BAG_L1': 0.10300803184509277,
'KNeighborsDist_BAG_L1': 0.1040341854095459,
'LightGBMXT_BAG_L1': 7.386584997177124,
'LightGBM_BAG_L1': 4.500877857208252,
'RandomForestMSE_BAG_L1': 0.5655171871185303,
'CatBoost_BAG_L1': 0.21315884590148926,
'ExtraTreesMSE_BAG_L1': 0.5582578182220459,
'NeuralNetFastAI_BAG_L1': 0.4888899326324463,
'WeightedEnsemble_L2': 0.0012354850769042969,
'LightGBMXT_BAG_L2': 0.6520588397979736,
'LightGBM_BAG_L2': 0.2843458652496338,
'RandomForestMSE_BAG_L2': 0.6096453666687012,
'CatBoost_BAG_L2': 0.13966703414916992,
'WeightedEnsemble_L3': 0.0013093948364257812},
'num_bag_folds': 8,
'max_stack_level': 3,
'model_hyperparams': {'KNeighborsUnif_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'KNeighborsDist_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'LightGBMXT_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'RandomForestMSE_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'CatBoost_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'ExtraTreesMSE_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'NeuralNetFastAI_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L2': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBMXT_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'RandomForestMSE_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'CatBoost_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L3': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True}},
'leaderboard': model score_val pred_time_val fit_time \
0 WeightedEnsemble_L3 -30.554656 15.607355 555.528752
1 LightGBM_BAG_L2 -30.728931 14.204675 422.493474
2 CatBoost_BAG_L2 -31.178451 14.059996 474.882658
3 LightGBMXT_BAG_L2 -31.997696 14.572388 425.438654
4 RandomForestMSE_BAG_L2 -32.134557 14.529974 424.020685
5 WeightedEnsemble_L2 -32.213506 12.771409 348.123248
6 LightGBM_BAG_L1 -32.985441 4.500878 54.476392
7 LightGBMXT_BAG_L1 -35.879059 7.386585 77.218651
8 CatBoost_BAG_L1 -35.986122 0.213159 205.830238
9 ExtraTreesMSE_BAG_L1 -38.895772 0.558258 5.564719
10 RandomForestMSE_BAG_L1 -39.169105 0.565517 9.879488
11 NeuralNetFastAI_BAG_L1 -75.020906 0.488890 44.199071
12 KNeighborsDist_BAG_L1 -84.125061 0.104034 0.024144
13 KNeighborsUnif_BAG_L1 -101.546199 0.103008 0.019241
pred_time_val_marginal fit_time_marginal stack_level can_infer \
0 0.001309 0.329112 3 True
1 0.284346 25.281531 2 True
2 0.139667 77.670714 2 True
3 0.652059 28.226711 2 True
4 0.609645 26.808742 2 True
5 0.001235 0.694335 2 True
6 4.500878 54.476392 1 True
7 7.386585 77.218651 1 True
8 0.213159 205.830238 1 True
9 0.558258 5.564719 1 True
10 0.565517 9.879488 1 True
11 0.488890 44.199071 1 True
12 0.104034 0.024144 1 True
13 0.103008 0.019241 1 True
fit_order
0 14
1 11
2 13
3 10
4 12
5 9
6 4
7 3
8 6
9 7
10 5
11 8
12 2
13 1 }
predictor_new_features.leaderboard(silent=True).plot(kind="bar", x="model", y="score_val")
<AxesSubplot:xlabel='model'>
# Save validation scores
leaderboard_nf = predictor_new_features.leaderboard()
leaderboard_nf["description"] = "scores with new features"
leaderboard_nf.to_csv("docs/leaderboard_nf.csv", index=False)
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order 0 WeightedEnsemble_L3 -30.554656 15.607355 555.528752 0.001309 0.329112 3 True 14 1 LightGBM_BAG_L2 -30.728931 14.204675 422.493474 0.284346 25.281531 2 True 11 2 CatBoost_BAG_L2 -31.178451 14.059996 474.882658 0.139667 77.670714 2 True 13 3 LightGBMXT_BAG_L2 -31.997696 14.572388 425.438654 0.652059 28.226711 2 True 10 4 RandomForestMSE_BAG_L2 -32.134557 14.529974 424.020685 0.609645 26.808742 2 True 12 5 WeightedEnsemble_L2 -32.213506 12.771409 348.123248 0.001235 0.694335 2 True 9 6 LightGBM_BAG_L1 -32.985441 4.500878 54.476392 4.500878 54.476392 1 True 4 7 LightGBMXT_BAG_L1 -35.879059 7.386585 77.218651 7.386585 77.218651 1 True 3 8 CatBoost_BAG_L1 -35.986122 0.213159 205.830238 0.213159 205.830238 1 True 6 9 ExtraTreesMSE_BAG_L1 -38.895772 0.558258 5.564719 0.558258 5.564719 1 True 7 10 RandomForestMSE_BAG_L1 -39.169105 0.565517 9.879488 0.565517 9.879488 1 True 5 11 NeuralNetFastAI_BAG_L1 -75.020906 0.488890 44.199071 0.488890 44.199071 1 True 8 12 KNeighborsDist_BAG_L1 -84.125061 0.104034 0.024144 0.104034 0.024144 1 True 2 13 KNeighborsUnif_BAG_L1 -101.546199 0.103008 0.019241 0.103008 0.019241 1 True 1
predictions_nf = predictor_new_features.predict(test)
predictions_nf.head()
0 15.458272 1 11.300541 2 10.280220 3 9.255146 4 8.010801 Name: count, dtype: float32
predictions_nf.describe()
count 6493.000000 mean 154.990692 std 134.164993 min 2.731123 25% 50.973583 50% 121.053001 75% 217.896561 max 789.729919 Name: count, dtype: float64
# Remember to set all negative values to zero
predictions_nf_df = pd.DataFrame(predictions_nf)
count_neg = len(predictions_nf_df[predictions_nf_df["count"] < 0])
if count_neg > 0:
predictions_nf_df.loc[predictions_nf_df["count"] < 0, ["count"]] = 0
print("{} Negative predictions were set to zero" . format(count_neg))
print(predictions_nf_df[predictions_nf_df["count"]==0])
else: print("{} negatives values were found" .format(count_neg))
0 negatives values were found
submission_new_features = pd.read_csv("sampleSubmission.csv", parse_dates=["datetime"])
submission_new_features.head()
| datetime | count | |
|---|---|---|
| 0 | 2011-01-20 00:00:00 | 0 |
| 1 | 2011-01-20 01:00:00 | 0 |
| 2 | 2011-01-20 02:00:00 | 0 |
| 3 | 2011-01-20 03:00:00 | 0 |
| 4 | 2011-01-20 04:00:00 | 0 |
# Same submitting predictions
submission_new_features["count"] = predictions_nf.round(0).astype(int)
submission_new_features.to_csv("submission_new_features.csv", index=False)
!kaggle competitions submit -c bike-sharing-demand -f submission_new_features.csv -m "model with new features"
100%|█████████████████████████████████████████| 149k/149k [00:00<00:00, 279kB/s] Successfully submitted to Bike Sharing Demand
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName date description status publicScore privateScore --------------------------- ------------------- -------------------------------------------- -------- ----------- ------------ submission_new_features.csv 2023-01-04 02:38:24 model with new features complete 0.65341 0.65341 submission.csv 2023-01-04 02:17:18 first raw submission complete 1.79200 1.79200 submission_hpo.csv 2023-01-04 01:59:41 model with new features and hpo complete 0.47675 0.47675 submission_hpo.csv 2023-01-04 01:45:44 model with new features and hpo complete 0.48014 0.48014 tail: write error: Broken pipe
#Score with one additional feature (hour): 0.67642
#Score with more features: 0.65341
hyperparameter and hyperparameter_tune_kwargs arguments.import autogluon.core as ag
# high level hyperparameters:
# num_stack_levels: max possible is 3.
# num_bag_folds: values between 5 -10 are recommended by Autogluon (default = 5 with best_quality={‘auto_stack’: True} presets) .
# num_bag_sets: max possible is 20 when time_limit hyperparameter is set.
# hyperparameters
gbm_options = {
'num_boost_round': 200,
'num_leaves': ag.space.Int(lower=26, upper=66, default=36),
'learning_rate' : 0.03,
}
cat_options = {
'iterations' : 10000,
'learning_rate' : 0.03,
'depth' : ag.space.Int(lower=2, upper=8, default=6)
}
hyperparameters = {
'GBM': gbm_options,
'CAT': cat_options,
}
# hyperparameter_tune_kwargs
num_trials = 5 # Restricted to time_limit hyperparameter.
hyperparameter_tune_kwargs = { # HPO is not performed unless hyperparameter_tune_kwargs is specified
'num_trials': num_trials,
'scheduler' : 'local',
'searcher': 'auto', # AutoGluon performs a random search
}
learner_kwargs = {
"ignored_columns": ["casual", "registered", "atemp", "windspeed", "humidity", "temp"]
}
predictor_hpo = TabularPredictor(label="count", learner_kwargs=learner_kwargs, problem_type="regression",
eval_metric="root_mean_squared_error").fit(
train_data=train,
time_limit=600,
num_stack_levels=3,
num_bag_folds=10,
num_bag_sets=20,
hyperparameters=hyperparameters,
hyperparameter_tune_kwargs=hyperparameter_tune_kwargs
)
No path specified. Models will be saved in: "AutogluonModels/ag-20230104_025616/"
Warning: hyperparameter tuning is currently experimental and may cause the process to hang.
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20230104_025616/"
AutoGluon Version: 0.6.1
Python Version: 3.7.10
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Wed Oct 26 20:36:53 UTC 2022
Train Data Rows: 10886
Train Data Columns: 17
Label Column: count
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Dropping user-specified ignored columns: ['casual', 'registered', 'atemp', 'windspeed', 'humidity', 'temp']
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 1879.08 MB
Train Data (Original) Memory Usage: 0.27 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting DatetimeFeatureGenerator...
/usr/local/lib/python3.7/site-packages/autogluon/features/generators/datetime.py:59: FutureWarning: casting datetime64[ns, UTC] values to int64 with .astype(...) is deprecated and will raise in a future version. Use .view(...) instead.
good_rows = series[~series.isin(bad_rows)].astype(np.int64)
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('category', []) : 9 | ['season', 'holiday', 'workingday', 'weather', 'time_of_day', ...]
('datetime', []) : 1 | ['datetime']
('int', []) : 1 | ['hour']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 7 | ['season', 'weather', 'time_of_day', 'atempcat', 'tempcat', ...]
('int', []) : 1 | ['hour']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
0.2s = Fit runtime
11 features in original data used to generate 15 features in processed data.
Train Data (Processed) Memory Usage: 0.62 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.23s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 4 stack levels (L1 to L4) ...
Fitting 2 L1 models ...
Hyperparameter tuning model: LightGBM_BAG_L1 ... Tuning model for up to 89.94s of the 599.76s of remaining time.
0%| | 0/5 [00:00<?, ?it/s]
Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy Stopping HPO to satisfy time limit... Fitted model: LightGBM_BAG_L1/T1 ... -39.6668 = Validation score (-root_mean_squared_error) 30.37s = Training runtime 0.0s = Validation runtime Fitted model: LightGBM_BAG_L1/T2 ... -41.5783 = Validation score (-root_mean_squared_error) 30.09s = Training runtime 0.0s = Validation runtime Fitted model: LightGBM_BAG_L1/T3 ... -36.7635 = Validation score (-root_mean_squared_error) 32.56s = Training runtime 0.0s = Validation runtime Hyperparameter tuning model: CatBoost_BAG_L1 ... Tuning model for up to 89.94s of the 506.48s of remaining time.
0%| | 0/5 [00:00<?, ?it/s]
Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy Stopping HPO to satisfy time limit... Fitted model: CatBoost_BAG_L1/T1 ... -43.2934 = Validation score (-root_mean_squared_error) 89.41s = Training runtime 0.0s = Validation runtime Completed 1/20 k-fold bagging repeats ... Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 416.91s of remaining time. -36.754 = Validation score (-root_mean_squared_error) 0.29s = Training runtime 0.0s = Validation runtime Fitting 2 L2 models ... Hyperparameter tuning model: LightGBM_BAG_L2 ... Tuning model for up to 83.29s of the 416.52s of remaining time.
0%| | 0/5 [00:00<?, ?it/s]
Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy Stopping HPO to satisfy time limit... Fitted model: LightGBM_BAG_L2/T1 ... -36.4531 = Validation score (-root_mean_squared_error) 31.57s = Training runtime 0.0s = Validation runtime Fitted model: LightGBM_BAG_L2/T2 ... -36.4951 = Validation score (-root_mean_squared_error) 32.42s = Training runtime 0.0s = Validation runtime Hyperparameter tuning model: CatBoost_BAG_L2 ... Tuning model for up to 83.29s of the 352.34s of remaining time.
0%| | 0/5 [00:00<?, ?it/s]
Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy Stopping HPO to satisfy time limit... Fitted model: CatBoost_BAG_L2/T1 ... -37.1213 = Validation score (-root_mean_squared_error) 84.68s = Training runtime 0.0s = Validation runtime Completed 1/20 k-fold bagging repeats ... Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the 267.48s of remaining time. -36.3611 = Validation score (-root_mean_squared_error) 0.23s = Training runtime 0.0s = Validation runtime Fitting 2 L3 models ... Hyperparameter tuning model: LightGBM_BAG_L3 ... Tuning model for up to 80.13s of the 267.16s of remaining time.
0%| | 0/5 [00:00<?, ?it/s]
Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy Stopping HPO to satisfy time limit... Fitted model: LightGBM_BAG_L3/T1 ... -37.1023 = Validation score (-root_mean_squared_error) 31.51s = Training runtime 0.0s = Validation runtime Fitted model: LightGBM_BAG_L3/T2 ... -36.9997 = Validation score (-root_mean_squared_error) 30.9s = Training runtime 0.0s = Validation runtime Hyperparameter tuning model: CatBoost_BAG_L3 ... Tuning model for up to 80.13s of the 204.54s of remaining time.
0%| | 0/5 [00:00<?, ?it/s]
Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy Stopping HPO to satisfy time limit... Fitted model: CatBoost_BAG_L3/T1 ... -36.6865 = Validation score (-root_mean_squared_error) 81.55s = Training runtime 0.0s = Validation runtime Completed 1/20 k-fold bagging repeats ... Fitting model: WeightedEnsemble_L4 ... Training model for up to 360.0s of the 122.83s of remaining time. -36.664 = Validation score (-root_mean_squared_error) 0.23s = Training runtime 0.0s = Validation runtime Fitting 2 L4 models ... Hyperparameter tuning model: LightGBM_BAG_L4 ... Tuning model for up to 55.14s of the 122.51s of remaining time.
0%| | 0/5 [00:00<?, ?it/s]
Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy Stopping HPO to satisfy time limit... Fitted model: LightGBM_BAG_L4/T1 ... -37.4417 = Validation score (-root_mean_squared_error) 30.68s = Training runtime 0.0s = Validation runtime Hyperparameter tuning model: CatBoost_BAG_L4 ... Tuning model for up to 55.14s of the 91.67s of remaining time.
0%| | 0/5 [00:00<?, ?it/s]
Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
Stopping HPO to satisfy time limit...
Fitted model: CatBoost_BAG_L4/T1 ...
-37.139 = Validation score (-root_mean_squared_error)
61.62s = Training runtime
0.0s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L5 ... Training model for up to 360.0s of the 29.88s of remaining time.
-37.1079 = Validation score (-root_mean_squared_error)
0.18s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 570.51s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230104_025616/")
#predictor_hpo = TabularPredictor.load("AutogluonModels/.../")
# Get detailed info of the predictor
pred_hpo_info = predictor_hpo.info()
with open('docs/pred_hpo_info.json', 'w') as convert_file:
convert_file.write(json.dumps(pred_hpo_info, default=str))
predictor_hpo.fit_summary(show_plot=False)
*** Summary of fit() ***
Estimated performance of each model:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L3 -36.361050 0.001702 331.316357 0.000933 0.233781 3 True 9
1 LightGBM_BAG_L2/T1 -36.453077 0.000551 213.984909 0.000095 31.565502 2 True 6
2 LightGBM_BAG_L2/T2 -36.495067 0.000588 214.839080 0.000131 32.419674 2 True 7
3 WeightedEnsemble_L4 -36.663953 0.002069 475.266736 0.000921 0.228529 4 True 13
4 CatBoost_BAG_L3/T1 -36.686524 0.000893 412.627928 0.000124 81.545351 3 True 12
5 WeightedEnsemble_L2 -36.753990 0.000954 122.255102 0.000747 0.292852 2 True 5
6 LightGBM_BAG_L1/T3 -36.763544 0.000098 32.556703 0.000098 32.556703 1 True 3
7 LightGBM_BAG_L3/T2 -36.999679 0.000899 361.981402 0.000131 30.898825 3 True 11
8 LightGBM_BAG_L3/T1 -37.102334 0.000894 362.594031 0.000125 31.511455 3 True 10
9 WeightedEnsemble_L5 -37.107944 0.002198 567.524043 0.000823 0.179885 5 True 16
10 CatBoost_BAG_L2/T1 -37.121271 0.000542 267.097401 0.000086 84.677994 2 True 8
11 CatBoost_BAG_L4/T1 -37.139012 0.001281 536.662032 0.000132 61.623824 4 True 15
12 LightGBM_BAG_L4/T1 -37.441667 0.001242 505.720334 0.000093 30.682127 4 True 14
13 LightGBM_BAG_L1/T1 -39.666828 0.000154 30.368811 0.000154 30.368811 1 True 1
14 LightGBM_BAG_L1/T2 -41.578265 0.000095 30.088346 0.000095 30.088346 1 True 2
15 CatBoost_BAG_L1/T1 -43.293445 0.000110 89.405546 0.000110 89.405546 1 True 4
Number of models trained: 16
Types of models trained:
{'StackerEnsembleModel_LGB', 'StackerEnsembleModel_CatBoost', 'WeightedEnsembleModel'}
Bagging used: True (with 10 folds)
Multi-layer stack-ensembling used: True (with 5 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', []) : 7 | ['season', 'weather', 'time_of_day', 'atempcat', 'tempcat', ...]
('int', []) : 1 | ['hour']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20230104_025616/SummaryOfModels.html
*** End of fit() summary ***
{'model_types': {'LightGBM_BAG_L1/T1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L1/T2': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L1/T3': 'StackerEnsembleModel_LGB',
'CatBoost_BAG_L1/T1': 'StackerEnsembleModel_CatBoost',
'WeightedEnsemble_L2': 'WeightedEnsembleModel',
'LightGBM_BAG_L2/T1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L2/T2': 'StackerEnsembleModel_LGB',
'CatBoost_BAG_L2/T1': 'StackerEnsembleModel_CatBoost',
'WeightedEnsemble_L3': 'WeightedEnsembleModel',
'LightGBM_BAG_L3/T1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L3/T2': 'StackerEnsembleModel_LGB',
'CatBoost_BAG_L3/T1': 'StackerEnsembleModel_CatBoost',
'WeightedEnsemble_L4': 'WeightedEnsembleModel',
'LightGBM_BAG_L4/T1': 'StackerEnsembleModel_LGB',
'CatBoost_BAG_L4/T1': 'StackerEnsembleModel_CatBoost',
'WeightedEnsemble_L5': 'WeightedEnsembleModel'},
'model_performance': {'LightGBM_BAG_L1/T1': -39.666828152831606,
'LightGBM_BAG_L1/T2': -41.578265197796064,
'LightGBM_BAG_L1/T3': -36.76354412409187,
'CatBoost_BAG_L1/T1': -43.29344521141029,
'WeightedEnsemble_L2': -36.75399003349142,
'LightGBM_BAG_L2/T1': -36.45307681065257,
'LightGBM_BAG_L2/T2': -36.49506683889113,
'CatBoost_BAG_L2/T1': -37.12127084525256,
'WeightedEnsemble_L3': -36.361050242497605,
'LightGBM_BAG_L3/T1': -37.10233449608509,
'LightGBM_BAG_L3/T2': -36.99967884689942,
'CatBoost_BAG_L3/T1': -36.686524245722254,
'WeightedEnsemble_L4': -36.66395318572601,
'LightGBM_BAG_L4/T1': -37.44166702766272,
'CatBoost_BAG_L4/T1': -37.13901210163851,
'WeightedEnsemble_L5': -37.10794399329098},
'model_best': 'WeightedEnsemble_L3',
'model_paths': {'LightGBM_BAG_L1/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L1/T1/',
'LightGBM_BAG_L1/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L1/T2/',
'LightGBM_BAG_L1/T3': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L1/T3/',
'CatBoost_BAG_L1/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/CatBoost_BAG_L1/T1/',
'WeightedEnsemble_L2': 'AutogluonModels/ag-20230104_025616/models/WeightedEnsemble_L2/',
'LightGBM_BAG_L2/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L2/T1/',
'LightGBM_BAG_L2/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L2/T2/',
'CatBoost_BAG_L2/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/CatBoost_BAG_L2/T1/',
'WeightedEnsemble_L3': 'AutogluonModels/ag-20230104_025616/models/WeightedEnsemble_L3/',
'LightGBM_BAG_L3/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L3/T1/',
'LightGBM_BAG_L3/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L3/T2/',
'CatBoost_BAG_L3/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/CatBoost_BAG_L3/T1/',
'WeightedEnsemble_L4': 'AutogluonModels/ag-20230104_025616/models/WeightedEnsemble_L4/',
'LightGBM_BAG_L4/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L4/T1/',
'CatBoost_BAG_L4/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/CatBoost_BAG_L4/T1/',
'WeightedEnsemble_L5': 'AutogluonModels/ag-20230104_025616/models/WeightedEnsemble_L5/'},
'model_fit_times': {'LightGBM_BAG_L1/T1': 30.36881136894226,
'LightGBM_BAG_L1/T2': 30.088345527648926,
'LightGBM_BAG_L1/T3': 32.556703329086304,
'CatBoost_BAG_L1/T1': 89.40554642677307,
'WeightedEnsemble_L2': 0.29285240173339844,
'LightGBM_BAG_L2/T1': 31.565502166748047,
'LightGBM_BAG_L2/T2': 32.419673681259155,
'CatBoost_BAG_L2/T1': 84.67799425125122,
'WeightedEnsemble_L3': 0.2337806224822998,
'LightGBM_BAG_L3/T1': 31.511454582214355,
'LightGBM_BAG_L3/T2': 30.89882493019104,
'CatBoost_BAG_L3/T1': 81.54535126686096,
'WeightedEnsemble_L4': 0.22852873802185059,
'LightGBM_BAG_L4/T1': 30.682126760482788,
'CatBoost_BAG_L4/T1': 61.62382435798645,
'WeightedEnsemble_L5': 0.179884672164917},
'model_pred_times': {'LightGBM_BAG_L1/T1': 0.0001537799835205078,
'LightGBM_BAG_L1/T2': 9.5367431640625e-05,
'LightGBM_BAG_L1/T3': 9.751319885253906e-05,
'CatBoost_BAG_L1/T1': 0.00010991096496582031,
'WeightedEnsemble_L2': 0.0007467269897460938,
'LightGBM_BAG_L2/T1': 9.465217590332031e-05,
'LightGBM_BAG_L2/T2': 0.00013136863708496094,
'CatBoost_BAG_L2/T1': 8.58306884765625e-05,
'WeightedEnsemble_L3': 0.0009334087371826172,
'LightGBM_BAG_L3/T1': 0.00012540817260742188,
'LightGBM_BAG_L3/T2': 0.00013065338134765625,
'CatBoost_BAG_L3/T1': 0.00012445449829101562,
'WeightedEnsemble_L4': 0.0009205341339111328,
'LightGBM_BAG_L4/T1': 9.298324584960938e-05,
'CatBoost_BAG_L4/T1': 0.00013208389282226562,
'WeightedEnsemble_L5': 0.0008234977722167969},
'num_bag_folds': 10,
'max_stack_level': 5,
'model_hyperparams': {'LightGBM_BAG_L1/T1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L1/T2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L1/T3': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'CatBoost_BAG_L1/T1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L2': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L2/T1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L2/T2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'CatBoost_BAG_L2/T1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L3': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L3/T1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L3/T2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'CatBoost_BAG_L3/T1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L4': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L4/T1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'CatBoost_BAG_L4/T1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L5': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True}},
'leaderboard': model score_val pred_time_val fit_time \
0 WeightedEnsemble_L3 -36.361050 0.001702 331.316357
1 LightGBM_BAG_L2/T1 -36.453077 0.000551 213.984909
2 LightGBM_BAG_L2/T2 -36.495067 0.000588 214.839080
3 WeightedEnsemble_L4 -36.663953 0.002069 475.266736
4 CatBoost_BAG_L3/T1 -36.686524 0.000893 412.627928
5 WeightedEnsemble_L2 -36.753990 0.000954 122.255102
6 LightGBM_BAG_L1/T3 -36.763544 0.000098 32.556703
7 LightGBM_BAG_L3/T2 -36.999679 0.000899 361.981402
8 LightGBM_BAG_L3/T1 -37.102334 0.000894 362.594031
9 WeightedEnsemble_L5 -37.107944 0.002198 567.524043
10 CatBoost_BAG_L2/T1 -37.121271 0.000542 267.097401
11 CatBoost_BAG_L4/T1 -37.139012 0.001281 536.662032
12 LightGBM_BAG_L4/T1 -37.441667 0.001242 505.720334
13 LightGBM_BAG_L1/T1 -39.666828 0.000154 30.368811
14 LightGBM_BAG_L1/T2 -41.578265 0.000095 30.088346
15 CatBoost_BAG_L1/T1 -43.293445 0.000110 89.405546
pred_time_val_marginal fit_time_marginal stack_level can_infer \
0 0.000933 0.233781 3 True
1 0.000095 31.565502 2 True
2 0.000131 32.419674 2 True
3 0.000921 0.228529 4 True
4 0.000124 81.545351 3 True
5 0.000747 0.292852 2 True
6 0.000098 32.556703 1 True
7 0.000131 30.898825 3 True
8 0.000125 31.511455 3 True
9 0.000823 0.179885 5 True
10 0.000086 84.677994 2 True
11 0.000132 61.623824 4 True
12 0.000093 30.682127 4 True
13 0.000154 30.368811 1 True
14 0.000095 30.088346 1 True
15 0.000110 89.405546 1 True
fit_order
0 9
1 6
2 7
3 13
4 12
5 5
6 3
7 11
8 10
9 16
10 8
11 15
12 14
13 1
14 2
15 4 }
predictor_hpo.leaderboard(silent=True).plot(kind="bar", x="model", y="score_val")
<AxesSubplot:xlabel='model'>
# Save validation scores
leaderboard_hpo = predictor_hpo.leaderboard()
leaderboard_hpo["description"] = "hpo scores"
leaderboard_hpo.to_csv("docs/leaderboard_hpo.csv", index=False)
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order 0 WeightedEnsemble_L3 -36.361050 0.001702 331.316357 0.000933 0.233781 3 True 9 1 LightGBM_BAG_L2/T1 -36.453077 0.000551 213.984909 0.000095 31.565502 2 True 6 2 LightGBM_BAG_L2/T2 -36.495067 0.000588 214.839080 0.000131 32.419674 2 True 7 3 WeightedEnsemble_L4 -36.663953 0.002069 475.266736 0.000921 0.228529 4 True 13 4 CatBoost_BAG_L3/T1 -36.686524 0.000893 412.627928 0.000124 81.545351 3 True 12 5 WeightedEnsemble_L2 -36.753990 0.000954 122.255102 0.000747 0.292852 2 True 5 6 LightGBM_BAG_L1/T3 -36.763544 0.000098 32.556703 0.000098 32.556703 1 True 3 7 LightGBM_BAG_L3/T2 -36.999679 0.000899 361.981402 0.000131 30.898825 3 True 11 8 LightGBM_BAG_L3/T1 -37.102334 0.000894 362.594031 0.000125 31.511455 3 True 10 9 WeightedEnsemble_L5 -37.107944 0.002198 567.524043 0.000823 0.179885 5 True 16 10 CatBoost_BAG_L2/T1 -37.121271 0.000542 267.097401 0.000086 84.677994 2 True 8 11 CatBoost_BAG_L4/T1 -37.139012 0.001281 536.662032 0.000132 61.623824 4 True 15 12 LightGBM_BAG_L4/T1 -37.441667 0.001242 505.720334 0.000093 30.682127 4 True 14 13 LightGBM_BAG_L1/T1 -39.666828 0.000154 30.368811 0.000154 30.368811 1 True 1 14 LightGBM_BAG_L1/T2 -41.578265 0.000095 30.088346 0.000095 30.088346 1 True 2 15 CatBoost_BAG_L1/T1 -43.293445 0.000110 89.405546 0.000110 89.405546 1 True 4
predictions_hpo = predictor_hpo.predict(test)
predictions_hpo.head()
0 11.541133 1 6.640240 2 6.064966 3 5.733857 4 5.725714 Name: count, dtype: float32
predictions_hpo.describe()
count 6493.000000 mean 191.760544 std 172.945343 min 5.417264 25% 46.607738 50% 151.285156 75% 283.315765 max 867.180420 Name: count, dtype: float64
# Remember to set all negative values to zero
predictions_hpo_rev = pd.DataFrame(predictions_hpo)
count_neg = len(predictions_hpo_rev[predictions_hpo_rev["count"] < 0])
if count_neg > 0:
predictions_hpo_rev.loc[predictions_hpo_rev["count"] < 0, ["count"]] = 0
print("{} Negative predictions were set to zero" . format(count_neg))
print(predictions_hpo_rev[predictions_hpo_rev["count"]==0])
else: print("{} negatives values were found" .format(count_neg))
0 negatives values were found
submission_hpo = pd.read_csv("sampleSubmission.csv", parse_dates=["datetime"])
submission_hpo.head()
| datetime | count | |
|---|---|---|
| 0 | 2011-01-20 00:00:00 | 0 |
| 1 | 2011-01-20 01:00:00 | 0 |
| 2 | 2011-01-20 02:00:00 | 0 |
| 3 | 2011-01-20 03:00:00 | 0 |
| 4 | 2011-01-20 04:00:00 | 0 |
# Same submitting predictions
submission_hpo["count"] = predictions_hpo_rev.round(0).astype(int)
submission_hpo.to_csv("submission_hpo.csv", index=False)
submission_hpo.describe()
| count | |
|---|---|
| count | 6493.000000 |
| mean | 191.763130 |
| std | 172.935337 |
| min | 5.000000 |
| 25% | 47.000000 |
| 50% | 151.000000 |
| 75% | 283.000000 |
| max | 867.000000 |
!kaggle competitions submit -c bike-sharing-demand -f submission_hpo.csv -m "model with new features and hpo"
100%|█████████████████████████████████████████| 149k/149k [00:00<00:00, 263kB/s] Successfully submitted to Bike Sharing Demand
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName date description status publicScore privateScore --------------------------- ------------------- -------------------------------------------- -------- ----------- ------------ submission_hpo.csv 2023-01-04 03:46:50 model with new features and hpo complete 0.47562 0.47562 submission_new_features.csv 2023-01-04 02:38:24 model with new features complete 0.65341 0.65341 submission.csv 2023-01-04 02:17:18 first raw submission complete 1.79200 1.79200 submission_hpo.csv 2023-01-04 01:59:41 model with new features and hpo complete 0.47675 0.47675 tail: write error: Broken pipe
#Score (high-level hyperparameters only): 0.62542
#Score (high-level hyperparameters only, hyperparameters, and hyperparameter_tune_kwargs): 0.47562
# Taking the top model score from each training run and creating a line plot to show improvement
# You can create these in the notebook and save them to PNG or use some other tool (e.g. google sheets, excel)
fig = pd.DataFrame(
{
"model": ["initial", "add_features", "hpo"],
"score": [53.073174, 30.554656, 36.361050]
}
).plot(x="model", y="score", figsize=(8, 6)).get_figure()
fig.savefig('img/model_train_score.png')
# Take the 3 kaggle scores and creating a line plot to show improvement
fig = pd.DataFrame(
{
"test_eval": ["initial", "add_features", "hpo"],
"score": [1.79200, 0.65341, 0.47562]
}
).plot(x="test_eval", y="score", figsize=(8, 6)).get_figure()
fig.savefig('img/model_test_score.png')
# The 3 hyperparameters we tuned with the kaggle score as the result
pd.DataFrame({
"model": ["initial", "add_features", "hpo"],
"num_stack_levels": [3, 3, 5],
"num_bag_folds": [8, 8, 10],
"num_bag_sets": [20, 20, 20],
"score": [1.79200, 0.65341, 0.47562]
})
| model | num_stack_levels | num_bag_folds | num_bag_sets | score | |
|---|---|---|---|---|---|
| 0 | initial | 3 | 8 | 20 | 1.79200 |
| 1 | add_features | 3 | 8 | 20 | 0.65341 |
| 2 | hpo | 5 | 10 | 20 | 0.47562 |